Assignment

A short description of the post.

Kevin Sunga true
07-11-2021

Background

During 20-21 Jan 2014, on the island country of Kronos, several employees of GAStech, a Tethys gas multinational, go missing. Who is missing? Where have they gone? Were they kidnapped? If so, who is responsible?

To get to the bottom of this mystery, I will use visual analytic techniques to analyze data provided by GAStech to assist with law enforcement’s investigation and hopefully find the missing persons and bring them home safely. The data provided by GAStech covering the two weeks prior to the GAStech employees’ disappearance are as follows:

Objective

The objective of this project is to use visual analytic techniques to surface and identify anomalies and suspicious behavior. More precisely, I aim to shed light on the following questions: 1. Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? 2. Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? 3. Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? 4. Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. 5. Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.

Literature Review

We reviewed the submissions for VAST Challenge 2014 for a better appreciation of approaches and techniques adopted to solved Mini-Challenge 2. We summarise below useful methodologies we wish to consider for our project.

The DAC-MC2 team from Virginia Tech used a methodology called Points of Interest (“POI”) to identify POIs such as people’s homes, their work places, and recreational locations (e.g. restaurants, cafes). A location is considered a POI if the time spent at a location is more than 5 minutes and the location has a diameter of less than 50 meters. They then graphed the distribution of POI over time for various days (weekdays and weekends) and locations (i.e. home, work, recreation).

Similarly, the MiddlesexMASS-Attfield-MC2 team from Middlesex used the Patterns of Life (“POL”) suite to create a map showing where each person was at any given time and for how long. They also overlayed credit card and loyalty transactions over their map.

To better understand the credit card and loyalty card data, the IIITH-YASHASWI-MC2 team from the International Institute of Information Technology Hyderabad visualized the distributions of credit card transactions by date and by person and employment title / employee title. This analysis enabled them to better understand typical patterns and identify transactions that deviated and stood out.

The Purdue-Guo-MC2 team from Purdue University created a Social Relationship Matrix, which involves plotting a heatmap of the number of times GAStech employees meet each other over the course of the two weeks. The assumption is that the more frequent people meet, the closer relationship they have between them.

Approach & Analysis

Set Up

The packages I use in my analysis are listed below.

#Load packages

packages = c("raster", "sf", "clock", "tmap",
             "tidyverse", "rgdal", "htmlwidgets", "leaflet",
             "geosphere", "sqldf", "ggwordcloud", "ggforce",
             "sessioninfo", "plotly", "lubridate", "viridis",
             "ggraph", "igraph", "ggiraph")

for (p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

I load the data used for the analysis.

# Import data

df_cars <- read_csv("data/car-assignments.csv")

df_gps <- read_csv("data/gps.csv")

df_cc <- read_csv("data/cc_data.csv")

df_loyalty <- read_csv("data/loyalty_data.csv")

df_meetups <- readRDS("data/df_meetups.rds")

Cleaning and Preparation

I noticed that Katerina’s Café was causing issues with R likely due to the apostrophe and é characters. I clean this to prevent issues.

# Clean location names
df_cc <- df_cc %>%
  mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

df_loyalty <- df_loyalty %>%
  mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

I use date_time_parse from the Clock R package to format the timestamps into POSIXct datatypes.

# Format timestamps
df_cc$day_timestamp <- date_time_parse(df_cc$timestamp,
                                       zone = "",
                                       format = "%m/%d/%Y")

df_cc$timestamp <- date_time_parse(df_cc$timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M")

df_loyalty <- df_loyalty %>%
  mutate(timestamp = date_time_parse(timestamp,
                                     zone = "",
                                     format = "%m/%d/%Y")) %>%
  rename(day_timestamp = timestamp)

df_gps$day_timestamp <- date_time_parse(df_gps$Timestamp,
                                        zone = "",
                                        format = "%m/%d/%Y")

df_gps$Timestamp <- date_time_parse(df_gps$Timestamp,
                                    zone = "",
                                    format = "%m/%d/%Y %H:%M")

I use get_hour from the Clock R package and wday from the Lubridate R package to get hour of day and day of week. I later use these fields for visualizing various time dimensions.

# Get various dates and times
df_cc$hour_of_day <- get_hour(df_cc$timestamp)

df_cc$day_of_week <- wday(df_cc$timestamp, label = TRUE, abbr = FALSE)

df_loyalty$day_of_week <- wday(df_loyalty$day_timestamp,
                               label = TRUE,
                               abbr = FALSE)

df_gps$hour_of_day <- get_hour(df_gps$Timestamp)

df_gps$day_of_week <- wday(df_gps$Timestamp, label = TRUE, abbr = FALSE)

I use the sf R package to turn the longitude and latitude coordinate points into geometries.

# Format coordinates
df_gps <- st_as_sf(df_gps,
                   coords = c("long", "lat"),
                   crs = 4326)

I join credit card data with loyalty card data, and I join car assignments data with gps data.

TO DO: Justify the join logic.

# Join credit card and loyalty data
df_cc_loyalty <- df_cc %>%
  inner_join(df_loyalty, by = c("day_timestamp" = "day_timestamp",
                                "location" = "location",
                                "price" = "price"))

# Join gps and cars data
df_gps_cars <- df_gps %>%
  left_join(df_cars, by = c("id" = "CarID"))

Analysis

Q1

To determine the most popular locations, I use interactive heatmap visualizations of credit card and loyalty card data. I visualize this data by location and across three time dimensions where possible: hour of day, date, and day of week.

Note: For brevity, I will only elaborate and surface the code for one heatmap. The general approach I’ve taken is the same across heatmaps.

I create these global variables to sort the locations on the y-axis of the heatmaps.

# Create global variables. Later used for sorting
cc_locations <- unique(df_cc$location)
cc_locations <- sort(cc_locations, decreasing = TRUE)

loyalty_locations <- unique(df_loyalty$location)
loyalty_locations <- sort(loyalty_locations, decreasing = TRUE)

To generate interactive heatmaps, I use ggplot and ploty R packages. First, I prepare the data by counting the number of transactions or rows. This # will later represent the fill or shading in the heatmap. Next, I pass the data to ggplot where I set up the graph and customize the aesthetics. Lastly, I pass the ggplot object to plotly to make the graph interactive. The code below is an example snippet.

# Hour of day
df_hour_of_day <- df_cc %>%
  count(location, hour_of_day) %>%
  mutate(location = as.factor(location),
         hour_of_day = as.factor(hour_of_day),
         text = paste0("Location: ", location, "\n", "Hour: ", hour_of_day, "\n", "n: ", n))

hm_hour_of_day <- ggplot(df_hour_of_day, aes(hour_of_day,
                                             location,
                                             fill = n,
                                             text = text)) +
  geom_tile() +
  scale_fill_viridis(discrete = FALSE) +
  scale_y_discrete(limits = cc_locations) +
  scale_x_discrete() +
  ggtitle("# of Credit Card Transactions by Hour of Day") +
  xlab("Hour of Day") +
  theme(panel.grid.major = element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 6),
        axis.text.y = element_text(size = 6),
        axis.title.y = element_blank())
ggplotly(hm_hour_of_day, tooltip = "text")

Hour of Day

Note: Timestamps in the loyalty data set did not have hours and minutes. Therefore, I was unable to produce a heatmap for loyalty data by hour of day.

Date

Day of Week

TO DO: Add anomalies and corrections

TO DO: Discrepancy with location data

Q2

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Discrepancies Between Credit Card and Loyalty Card Data

First, I’m going to tackle the discrepancies between credit and loyalty card data.

I prepare the data by getting the counts for credit card data and loyalty card data by location and date. I then join the two data together by joining on location and day_timestamp. Once joined together, I calculate the difference between the # of credit card records and loyalty card records. The difference will the represent the fill or shading in this heatmap. Where the difference is 0, I turn the value into NA to filter out from the heatmap.

# Prep data for heatmap diff
df_cc_helper <- df_cc %>%
  count(location, day_timestamp)

df_loyalty_helper <- df_loyalty %>%
  count(location, day_timestamp)

df_diff <- df_cc_helper %>%
  full_join(df_loyalty_helper,by = c("location" = "location",
                                      "day_timestamp" = "day_timestamp")) %>%
  rename(n_cc = n.x,
         n_loyalty = n.y) %>%
  replace_na(list(n_cc = 0, n_loyalty = 0)) %>%
  mutate(diff = n_cc - n_loyalty,
         diff = na_if(diff, 0))

df_diff <- df_diff %>%
  filter(!is.na(diff)) %>%
  mutate(location = as.factor(location),
         day_timestamp = as.factor(day_timestamp),
         text = paste0("Location: ", location, "\n", "Date: ", day_timestamp, "\n", "Diff: ", diff))
# Create global variables. Later used for sorting
diff_locations <- unique(df_diff$location)
diff_locations <- sort(diff_locations, decreasing = TRUE)

Once the data is prepared, I use the same code as above to generate the following heatmap.

Adding Vehicle Data

To compare the results from credit card and loyalty card data with vehicle data, I must georeference the map of Abila and determine what GPS coordinates to plot.

I used QGIS to georeference the map of Abila. I then load the georeferenced map using the Raster R package.

bgmap <- raster("data/MC2-tourist_modified.tif")

To determine what GPS coordinates to plot, I take a page from the DAC-MC2 team from Virginia Tech and calculate “Points of Interest” or POIs. We know that the tracking devices are tracking the car’s location periodically as long as the car is moving. Thus, when there is a gap in the data, we know the car must have stopped or parked. I define a POI as a point where the car was stopped for at least 5 minutes.

# Points of Interest (POIs)
df_poi <- df_gps_cars %>%
  group_by(id) %>%
  mutate(DepartureTimestamp = lead(Timestamp, order_by = id)) %>%
  mutate(Timestamp_diff_seconds = DepartureTimestamp - Timestamp) %>%
  mutate(is_poi = ifelse(Timestamp_diff_seconds >= 60 * 5, TRUE, FALSE)) %>%
  filter(is_poi == TRUE)

df_poi <- rename(df_poi, ArrivalTimestamp = Timestamp)

I graph the POIs onto the georeferenced map.

gps_dots_selected <- df_poi %>%
  mutate(MinutesDuration = round(Timestamp_diff_seconds / 60, 2)) %>%
  select(id,
         day_timestamp,
         ArrivalTimestamp,
         DepartureTimestamp,
         MinutesDuration,
         LastName,
         FirstName,
         CurrentEmploymentType,
         CurrentEmploymentTitle)
tmap_mode("view")
m <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b =3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
tm_shape(gps_dots_selected) +
  tm_dots(col = 'red', border.col = 'gray', size = .1, alpha = 0.3, jitter = 1) +
  tm_facets(by="day_timestamp", ncol = 2)
m

TO DO: Figure out what variations of the map to show.

gps_dots_selected <- gps_sf %>%
  filter(CurrentEmploymentType == "Security") %>%
  mutate(MinutesDuration = round(Timestamp_diff_seconds / 60, 2)) %>%
  select(id,
         ArrivalTimestamp,
         DepartureTimestamp,
         MinutesDuration,
         LastName,
         FirstName,
         CurrentEmploymentType,
         CurrentEmploymentTitle,
         gps_geometry)
tmap_mode("view")
m <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b =3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
tm_shape(gps_dots_selected) +
  tm_dots(col = 'red', border.col = 'gray', size = 0.007, alpha = 0.3)
m

How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

TO DO: Interactive data table + facet map

Q3

Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?

TO DO:

Q4

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships.

TO DO: Pull from 1-4

Q5

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.

FALSE
[1] FALSE
FALSE
[1] FALSE
FALSE
[1] FALSE
FALSE
[1] FALSE
FALSE
[1] FALSE
gps_path <- gps_sfx %>%
  group_by(id) %>%
  summarize(m = mean(Timestamp),
            do_union = FALSE) %>%
  st_cast("LINESTRING")

gps_path_selected <- gps_path %>%
  filter(id == 1)
tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b =3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
tm_shape(gps_path_selected) +
  tm_lines()
# Facets map
gps_dots_selected <- df_poi %>%
  mutate(MinutesDuration = round(Timestamp_diff_seconds / 60, 2)) %>%
  select(id,
         ArrivalTimestamp,
         DepartureTimestamp,
         MinutesDuration,
         LastName,
         FirstName,
         CurrentEmploymentType,
         CurrentEmploymentTitle)
tmap_mode("view")
m <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b =3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
tm_shape(gps_dots_selected) +
  tm_dots(col = 'red', border.col = 'gray', size = .1, alpha = 0.3, jitter = 1) +
  tm_facets(by="id", ncol = 2)
m